This workbook will take you through step-by-step how to conduct the downstream analysis of your single-cell data in Seurat.
After you have done the tutorial, you can use this same workbook as a template for any future analyses that you run. However, one should bear in mind that you will need to ensure that you account for subtle differences in the data. See “Loading data into Seurat: Things to consider” in the provided presentation for more information.
/ to \. Please try and run the code first. If you have any issues then please adpat the code.To use the software packages we need in R, first you will have to load the packages. The library() function calls the package you want a loads it into the environment.
#Loading packages necessary for tutorial
library(Seurat)
library(ggplot2)
library(cowplot)
library(dplyr)
This involves slightly different steps depending on the scRNA-seq technology and pre-processing steps taken to get your counts per gene matrix. Here you will go through two examples:
data_10x <- Read10X(data.dir = "data/10X/filtered_gene_bc_matrices/")
setwd() to set the working directory to the folder you downloaded. For example this might be like: setwd("/Users/name/Downloads/EI-Single-Cell-Course")The Read10X() function reads in the output of the Cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
If however, you wanted to use a plate-based method you may find that the data is in a matrix format. In this case, the row names should be the gene names and the columns the cell ID.
NB: the following code has been commented out by placing a hashtag (#) before of it. This means that when you try and run this block it will not run. We wanted to give you an example of how to load in plate-based data (like that you generated in your previous tutorials) however, for this session we will continue looking at the reference 10X dataset.
# ss2_matrix <- read.table(file = 'data/ss2/ss2_raw_matrix.tsv', header = T, row.names = "external_gene_name")
# seurat_object <- CreateSeuratObject(counts = ss2_matrix, project = "scRNAseq_2021")
The Seurat object serves as a container that contains both your data (the matrix) and analysis you will perform for your single-cell dataset. You can assign this whatever name you’d like, making sure it contains no spaces. To do this, we provide our matrix/loaded 10X data in counts parameter, and give the project a title in the project parameter.
seurat_object <- CreateSeuratObject(counts = data_10x, project = "scRNAseq_2021")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
You may see the following warming message: Show in New WindowClear OutputExpand/Collapse Output Feature names cannot have underscores ('_'), replacing with dashes ('-'). If you do then do not panic or worry. This is just Seurat helping you out and cleaning you data for you!
If this has worked, when running seurat_object you should see this message pop up along with some more information.
seurat_object
## An object of class Seurat
## 32738 features across 2700 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
Seurat allows you to easily explore QC metrics and filter cells. We will use the following criteria:
The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes. Cell doublets or multiplets may exhibit an aberrantly high gene count
The total number of molecules detected within a cell (correlates strongly with unique genes)
The percentage of reads that map to the mitochondrial genome. Low-quality / dying cells often exhibit extensive mitochondrial contamination
PercentageFeatureSet() function is used to calculate what percentage of counts originates from a set of mitochondrial features (features and genes are used interchangibly). We need to add this metric into the meta data of our object before doing anything elseseurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object, pattern = "^MT-")
Now that you have calculated the percentage mitochondrial you can see that this has been added to your seurat_object. You can view this data in the meta.data table.
Question: Try showing the meta-data of the top 10 cells of our Seurat object. Hint, change the parameter of n.
head(seurat_object@meta.data, n=5)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 scRNAseq_2021 2421 781 3.0152829
## AAACATTGAGCTAC-1 scRNAseq_2021 4903 1352 3.7935958
## AAACATTGATCAGC-1 scRNAseq_2021 3149 1131 0.8891712
## AAACCGTGCTTCCG-1 scRNAseq_2021 2639 960 1.7430845
## AAACCGTGTATGCG-1 scRNAseq_2021 981 522 1.2232416
Tip: Remember we can use the function head() to see the first few rows of our dataset, and to select the meta data ‘slot’ of our seurat object we use seurat_object@meta.data$percent.mt which will alone return the values for percent.mt.
Here you will generate your first plot of the project! This will tell you some initial information about the dataset. This is very similar to what you have done before on Galaxy, but just using the Seurat package. To do this, we will generate a violin plot using the VlnPlot() function. This plotting function is very versatile as you later on in the workbook.
To do this we use:
VlnPlot(seurat_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
Question: What do you immediately notice about your dataset? What do you think this means about your dataset?
FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc. This plotting function can also be very useful if you wanted to compare meta-data or particular cell populations later on in your analysis. See the previous selection to see how to access your meta data.
First you can plot the number of counts against the calculated percentage of mitochondrial:
n_counts_vs_percent_mito_plot <- FeatureScatter(seurat_object, feature1 = "nCount_RNA", feature2 = "percent.mt")
n_counts_vs_percent_mito_plot
You can also plot the number of counts against the number of features (genes):
n_counts_vs_n_features_plot <- FeatureScatter(seurat_object, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
n_counts_vs_n_features_plot
Now we can combine these scatter plots together for easy of plotting. Note you can use the + operator to combine two plot objects.
n_counts_vs_percent_mito_plot + n_counts_vs_n_features_plot
Question: What is the relationship between the number of counts and the number of features(genes) in the cells?
Here are a couple of other commands that come in handy to get information about your dataset:
# To get the average overall counts per cell:
mean(seurat_object@meta.data$nCount_RNA)
## [1] 2366.9
# To get the average number of genes:
mean(seurat_object@meta.data$nFeature_RNA)
## [1] 846.9941
Choosing a cutoff or selection criteria is usually very specific to your analysis and the biological question you are asking. However, the plots generated above (the two feature scatter plots, and your violin plots) you can see where the majority of our data points are. Use these to choose your cut-off values for features, counts, and percent mitochondrial.
Remembering that:
Low-quality cells or empty droplets will often have very few genes
Cell doublets or multiplets may exhibit an aberrantly high gene count
Low-quality / dying cells often exhibit extensive mitochondrial contamination
‘Subset’ allows you to achieve this in one command using operators (see the slides for more information). From the above we have selected the following values:
nFeature_RNA) on your console and see what effect it has on the number of cells removed. Once you are happy with your chosen cut offs, you then assign them to your object.seurat_object <- subset(seurat_object, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
seurat_object
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
As you can see, compared to the original sample size of 2700 cells using this subsetting criteria you have managed to remove cells.
Question: How many cells did you subset out of your dataset?
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, Seurat employs a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor, and log-transforms the result. The scale factor by default is set to 10,000.
In this case, you can just copy the below line and it will do this all for you!
seurat_object <- NormalizeData(seurat_object, normalization.method = "LogNormalize", scale.factor = 10000)
? feature is a really useful tool that R gives you to find out what inputs a function requires.Question: Using ?NormalizeData on the console, can you find out what some other default parameters are.
You next need to calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in other cells). You will normally find that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
We can do this by implementing the FindVariableFeatures() function. Use the below function to find the top 2000 variable features in the dataset.
seurat_object <- FindVariableFeatures(seurat_object, selection.method = "vst", nfeatures = 2000)
Using head(), you can show the top 10 variable features in your dataset, and assign this to a variable called top10. You can change the value of n to show however many features you would like to inspect.
top10 <- head(VariableFeatures(seurat_object), n=10)
top10
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "FTL" "GNLY" "PF4" "FTH1"
## [9] "GNG11" "S100A8"
You can now plot the highly variable genes by plotting the average expression against the standised variance. The top 2000 most variable genes will be highlighted in Red while the rest will be black. The number of genes in each category is displayed next to the label on the legend.
var_feat_plot <- VariableFeaturePlot(seurat_object)
var_feat_plot
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16159 rows containing missing values (geom_point).
To go one step further, you can label the top 10 genes you already pulled out of the dataset. By adding data labels you can easily see these genes. Seurat has a helpher function to produce this plot called LabelPoints which we can use alongside the top 10 genes list to label them on the plot.**
var_feat_plot_labeled <- LabelPoints(plot = var_feat_plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
var_feat_plot_labeled
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 16159 rows containing missing values (geom_point).
Next, you apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:
This step gives equal weight in downstream analyses, so that the most highly-expressed genes do not dominate the rest of the analysis.
(This might take a little while, don’t panic!)
all_genes <- rownames(seurat_object)
seurat_object <- ScaleData(seurat_object, features = all_genes)
## Centering and scaling data matrix
Next you perform PCA on the scaled data. By default, only the previously determined highly variable features are used as input (here we are going forward with only the top 2000 most variable genes).
Run PCA on your Seurat object using the below command. This might print out a large log of gene names - don’t worry, that means it has worked!
seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object))
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, CTSS, S100A8, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, LGALS3, GSTP1, PYCARD, NCF2
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, STK17A, CTSW, CD247
## GIMAP5, AQP3, CCL5, TRAF3IP3, GZMA, CST7, MAL, ITM2A, HOPX, MYC
## GIMAP7, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3, LYAR, SAMD3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, HLA-DRB5, HLA-DPA1, HLA-DMB, FCRLA, HVCN1, LTB, BLNK
## P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, SMIM14, PPP1R14A, C16orf74, MZB1, RP11-428G5.5
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## GZMH, CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, ACTB, APOBEC3G
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, HVCN1, MALAT1, FCRLA, IRF8
## PLAC8, BLNK, SMIM14, PLD4, P2RX5, LAT2, IGLL5, SWAP70, ARHGAP24, FGFBP2
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, HIST1H2AC
## CLU, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## F13A1, NGFRAP1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, MYL9, RP11-367G6.3, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A
## HLA-DQA2, HLA-DRA, LINC00926, HLA-DRB5, GZMB, HIST1H2AC, HLA-DMA, HVCN1, HLA-DMB, FCRLA
## PF4, SDPR, FGFBP2, FCGR3A, PPBP, GNG11, PRF1, NKG7, PLAC8, GNLY
## Negative: VIM, IL7R, S100A6, S100A8, IL32, S100A4, GIMAP7, S100A9, S100A10, MAL
## AQP3, FYB, CD2, CD14, GIMAP4, LGALS2, ANXA1, RBP7, FCN1, S100A12
## LYZ, TMSB4X, GIMAP5, S100A11, MS4A6A, FOLR3, TRABD2A, AIF1, IL8, NELL2
## PC_ 5
## Positive: GZMB, FGFBP2, NKG7, S100A8, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, CCL3, LGALS2, CTSW, XCL2, CLIC3, S100A12, CCL5, RBP7
## CD14, MS4A6A, GSTP1, AKR1C3, IGFBP7, TYROBP, TTC38, FOLR3, XCL1, HOPX
## Negative: LTB, IL7R, VIM, CKB, AQP3, MS4A7, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, MAL, HN1, CD2, LILRB2, GDI2, CORO1B, ANXA5, TUBA1B, FAM110A
## PPA1, TRADD, ATP1A1, ABRACL, IL32, FYB, WARS, CTD-2006K23.1, TRAF3IP3, GPR183
Now if you inspect your Seurat object again you can see that you have 1 dimensional reduction calculated:. This means that you have added the results from your PCA reduction to the Seurat object.
seurat_object
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 2000 variable features)
## 1 dimensional reduction calculated: pca
Question: Can you show the top 5 features for the first 3 principal components?
You can see the first n features in x: y of your PCA using. i.e. seurat_object[["pca"]], dims = x:y, nfeatures = n). This command will show the top n features that are both positively and negatives contributing to the reduction.
print(seurat_object[["pca"]], dims = 1:15, nfeatures = 10)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, STK17A, CTSW, CD247
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, HIST1H2AC
## PC_ 4
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A
## Negative: VIM, IL7R, S100A6, S100A8, IL32, S100A4, GIMAP7, S100A9, S100A10, MAL
## PC_ 5
## Positive: GZMB, FGFBP2, NKG7, S100A8, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## Negative: LTB, IL7R, VIM, CKB, AQP3, MS4A7, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PC_ 6
## Positive: FCGR3A, CKB, MS4A7, RP11-290F20.3, SIGLEC10, HMOX1, VMO1, PPM1N, CTD-2006K23.1, MALAT1
## Negative: FCER1A, CLEC10A, GPX1, ID1, LGALS2, SERPINF1, MS4A6A, VIM, CACNA2D3, CD14
## PC_ 7
## Positive: GZMK, CCL5, KLRG1, LYAR, IL32, S100A4, GZMA, B2M, JAKMIP1, TIGIT
## Negative: IGFBP7, AKR1C3, SPON2, GZMB, PLAC8, CLIC3, HAVCR2, S100B, GNLY, FGFBP2
## PC_ 8
## Positive: RBP7, S100A12, S100A8, LINC00926, FOLR3, CD79A, IFI35, CD14, KLF6, S100A9
## Negative: FCER1A, CLEC10A, ENHO, SERPINF1, CD1C, RP6-91H8.3, CACNA2D3, PLD4, GAS6, PPP2R3C
## PC_ 9
## Positive: STMN1, ACTB, PCNA, KIAA0101, PTTG1, MYL9, RRM2, MZB1, IGJ, TYMS
## Negative: GZMK, IFIT1, LYAR, TNFSF10, HEXIM2, GIMAP4, IL7R, GTPBP2, TREX1, IFI6
## PC_ 10
## Positive: KLHL6, NFE2, ADA, TIMMDC1, FOLR3, ARL2BP, DHRS7, EPN1, MGAT1, LMAN2
## Negative: IFIT1, ISG15, MX1, IFI6, GBP1, UBE2L6, EPSTI1, OAS1, IFI35, IFI44
## PC_ 11
## Positive: HSPB11, PTTG1, STMN1, PCNA, GZMH, JAKMIP1, APOBEC3H, KIAA0101, LIG1, TYMS
## Negative: CRIP2, HAVCR2, LTB, ANXA1, HOPX, TRADD, KLF6, AHNAK, S100A4, S100A10
## PC_ 12
## Positive: XCL1, KLRC1, XCL2, HOPX, GZMK, SPTSSB, SRM, NCR3, ID2, CAPN12
## Negative: GZMH, TIGIT, FGFBP2, CCL4, ITGB7, TMEM116, S100A4, S100A10, CCL4L1, TBPL1
## PC_ 13
## Positive: MZB1, PPP1R14B, IGJ, PLD4, SERPINF1, LILRA4, IRF7, PPP1R14A, CLEC4C, PTGDS
## Negative: CD9, PLBD1, SH3BGRL, TBC1D15, STK17A, LEPROT, RAB4B, RTN4, ARRDC3, GZMH
## PC_ 14
## Positive: MAPK1IP1L, CHI3L2, ARL4A, MIS18A, RGS1, TYW5, SIAH2, METTL13, KLF6, RARA
## Negative: IRF7, PLBD1, TIMMDC1, RCHY1, IGJ, TAF1, TMEM165, FAAH, ATP6AP1, CES4A
## PC_ 15
## Positive: XCL2, XCL1, ATXN7L3B, YIPF3, TIGIT, FNTA, PI4KB, ARID5B, PITPNA-AS1, SEPHS2
## Negative: NCR3, FAM43A, GYG1, APOBEC3H, COMMD5, RCL1, TYMS, PCNA, MRPL14, CAPN12
Feel free to play around with different PCs this might come in handy later on in your analysis!
Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap()
You can visualise the top genes associated with each principle reduction components e.g. for the first 2 components using VizDimReduction().
VizDimLoadings(seurat_object, dims = 1:2)
Or alternatively by added some additional paramters you can view the first 6 components and their top 10 genes. The larger the loading (i.e. the more positive) the greater the variance and therefore the effect that gene has on that component.
VizDimLoadings(seurat_object, dims = 1:6, reduction = "pca", ncol = 3, nfeature=10)
Question: Which PCs show the most interesting loadings?
Another way to visualise the dimensional reduction produced by the PCA is to plot the components on a 2D scatter plot where each point is a cell and it’s positioned based on the cell embeddings determined by the reduction technique. By default, cells are colored by their identity class (can be changed with the group.by parameter if you have the relavent data in the your meta.data).
Here is an example for the first 2 components:
DimPlot(seurat_object, reduction = "pca", dims = 1:2)
Draws a heatmap focusing on a principal component. Both cells and genes are sorted by their principal component scores. Allows for nice visualization of sources of heterogeneity in the dataset.
e.g. for PC 1, and how to show multiple heatmaps on the same plot.
Question: Trying investigation different dimensions by change the number after the dim = parameter
DimHeatmap(seurat_object, dims = 1, cells = 500, balanced = TRUE)
You can also plot multiple componets heatmaps in one plot.
DimHeatmap(seurat_object, dims = 1:6, cells = 500, balanced = TRUE, ncol = 3)
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should you choose to include? 10? 20? 100?
To help determine this you can use an ‘Elbow Plot’, a method of ranking of principle components based on the percentage of variance explained by each PC. To generate the plot you can use the function: ElbowPlot()
ElbowPlot(seurat_object)
Question: In your plot you should observe an ‘elbow’ where the curve plateaues, a PC from which there is little change in standard deviation and so including them wouldn’t reveal much more about the dataset. From the results of your elbow plot, how many PCs are you going to use for your clustering and why?
Seurat applies a graph-based clustering approach, a method whereby cells are embedded in a graph structure with edges drawn between cells with similar feature expression patterns, and then attempts to partition this graph into highly interconnected ‘communities’. To cluster the cells (grouping cells together) we use the FindClusters() function.
First you construct a WNN (weighted nearest neighbours) graph based on the distance in PCA space, and refine the edge ‘weights’ between any two cells based on the shared overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (e.g. first 10 PCs).
This is where your principal component analysis is important!
To do: Using your results from the PCA section set the number of dimensions for the FindNeighbours() function to use:
seurat_object <- FindNeighbors(seurat_object, dims = 1:10)
This function contains a parameter called resolution which sets the ‘granularity’ of the clustering.
seurat_object <- FindClusters(seurat_object, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95888
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8758
## Number of communities: 9
## Elapsed time: 0 seconds
To do: Look at cluster IDs of the first 5 cells eg. head(Idents(object), x)
Running this command will display the first 10 cells in your data with their newly assigned clusters (currently represented as numbers) that the louvain algorithm has decided the cell belongs too. Across the top will be the cell (either the barcode or the read name) and below it will be the cluster it is assinged to. At the very bottom you will see Levels:... this shows an overview of all the clusters found in the dataset.
head(Idents(seurat_object), 10)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 0 3 2 5
## AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
## 6 2 4 4
## AAACGCTGTAGCCA-1 AAACGCTGTTTCTG-1
## 4 5
## Levels: 0 1 2 3 4 5 6 7 8
Question: Try running FindClusters() at different resolutions by changing the value of resolution, how does changing the resolution affect the function?
Seurat offers several non-linear dimensional reduction techniques, such as tSNE or in our case UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying ‘manifold’ of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots.
seurat_object <- RunUMAP(seurat_object, 1:10)
You can visualize this with DimPlot the same as PCA, but now you define the reduction to plot like this:
DimPlot(seurat_object, reduction = "umap")
To do: Now that you can visualize your UMAP clusters, go back to where you ran FindNeighbours() and have a go again with a lower or higher resolution (keeping dims the same). You should see how the resolution affects the function more clearly.
Question: This is really important. What do you think would be impact to the conclusions you may draw from your experiment of using a resolution that is too low?
?DimPlot to see which things you can modify. E.g. you can change the colour scheme, the point size, the factor to colour the points by etc!*Here is the same UMAP coloured instead by the cell phenotypes. In this case we only have one item in the meta.data which we can plot. So we just display what project the data is from.
DimPlot(seurat_object, reduction = "umap", group.by = "orig.ident")
At this point in the analysis, particularly with larger datasets you might want to save you data. Just incase your computer crashes or if you forget to save your progress! An amazing feature in R, is that you can always save your Seurat Object. This is really useful for collaboration and sharing the data as well. To do this you can use the following command: saveRDS(seurat_object, file = 'seurat_object.rds')
Seurat can help you find markers (cluster/cell markers) that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
To find markers for every cluster compared to all remaining cells, and report only the positive ones you can use the below command. Note: This takes a little while to calculate, and will vary depending on the number of clusters you have.
seurat_markers <- FindAllMarkers(seurat_object, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Next we can extract the highest genes per cluster by their average log fold change (avg_log2FC in the data frame). You don’t need to pay too much attention on the numbers themselves for the purposes of our project, but ultimately the p value describes how significantly different the expression of that gene is in the cluster compared to the other clusters (ie. higher PC value, the more statistically significant the difference). And the FC describes how different the expression level is between a gene in one cluster compared to the others
Tip: you need to have loaded library(dplyr) to run the below but you have already imported this package at the start of the tutorial
The %>% is a magical operator in R which allows use to put lots of functions together. We are using it here to try and make the code a shorter and easier to follow.
top_2_markers <- seurat_markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
top_2_markers
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.25e- 83 1.32 0.44 0.111 1.72e- 78 0 CCR7
## 2 3.10e-113 1.10 0.913 0.592 1.01e-108 0 LDHB
## 3 0 5.54 0.996 0.216 0 1 S100A9
## 4 0 5.44 0.971 0.123 0 1 S100A8
## 5 2.00e- 82 1.24 0.982 0.646 6.55e- 78 2 LTB
## 6 1.45e- 53 1.22 0.415 0.115 4.76e- 49 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 3.10e-266 3 TCL1A
## 9 8.57e-194 3.15 0.588 0.046 2.81e-189 4 GZMK
## 10 6.94e-211 3.10 0.96 0.226 2.27e-206 4 CCL5
## 11 2.20e-184 3.31 0.981 0.135 7.21e-180 5 FCGR3A
## 12 2.61e-123 3.00 1 0.315 8.56e-119 5 LST1
## 13 2.30e-262 4.89 0.993 0.072 7.51e-258 6 GZMB
## 14 2.20e-172 4.88 0.964 0.136 7.19e-168 6 GNLY
## 15 3.75e-251 3.87 0.829 0.01 1.23e-246 7 FCER1A
## 16 8.78e- 23 2.80 1 0.512 2.87e- 18 7 HLA-DPB1
## 17 3.68e-110 8.57 1 0.024 1.21e-105 8 PPBP
## 18 7.73e-200 7.24 1 0.01 2.53e-195 8 PF4
Below is an example if we wanted to find the differentially expressed genes between two cluster of interest. In this example we are looking at our cluster 0 and cluster 2. As you can see from the UMAP above, they are next to each other, which could suggest that they have similar transcriptomic profiles. Using FindMarkers we can investigate further into why they were put into different clusters.
cluster_0_vs_cluster_2 <- FindMarkers(seurat_object, ident.1 = 0, ident.2 = 2, min.pct = 0.25)
head(cluster_0_vs_cluster_2, n=5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## S100A4 1.297339e-66 -1.3291832 0.684 0.954 4.247230e-62
## B2M 1.193567e-46 -0.3785696 1.000 1.000 3.907499e-42
## MALAT1 7.485856e-37 0.4668175 1.000 1.000 2.450719e-32
## RPL32 6.695351e-35 0.3381705 0.999 1.000 2.191924e-30
## ANXA1 8.680507e-32 -0.9767680 0.493 0.802 2.841824e-27
If we wanted to compare between 2 clusters, we would need to specify the idents. e.g. ident.1 = 3, ident.2 = 7. If you were doing the same against the
We can then start exploring the expression of these markers using: - DoHeatmap() - VlnPlot() - FeaturePlot() - DotPlot()
To do: A great way to initially visualize the top markers for all clusters is through a heatmap. This might not look perfectly polished but its a good starting point to compare all the top markers across your clusters.
top_10_markers <- seurat_markers %>%
group_by(cluster) %>%
slice_max(n = 10, order_by = avg_log2FC)
DoHeatmap(seurat_object, features = top_10_markers$gene) + NoLegend()
As in this case we know we are working with PBMC cells, so we know what cell-types to expect in the sample. We can do some litature research and find known marker genes for cells we might expect to see in our sample. The following code chunk adds both these markers from the literature (know_marker_genes) and the top 2 marker genes which we have extracted from our data (extracted_marker_genes).
know_marker_genes <- c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
extracted_marker_genes <- top_2_markers$gene
A feature plot shows the expression level of each gene within a cell and overlaps this onto the UMAP. The default implemntation of FeaturePlot() can be seen for our know_marker_genes.
FeaturePlot(seurat_object, features=know_marker_genes)
However, in some cases this plot can be misleading. By changing the parameters, order to bring any cell with the value to front, and label to make it easier to see which cluster a cell is in. You can see that from a little bit of trail and error you can extract some more information from your plots. This is particularly the case for the CD8A+ cells.
FeaturePlot(seurat_object, features=know_marker_genes, order = TRUE, label = TRUE) & NoLegend()
We can also look at the expression probability distributions across the clusters using the VlnPlot() function. Here we have a look at some of top genes we extracted using FindAllMarkers()
VlnPlot(seurat_object, features = extracted_marker_genes[1:6])
VlnPlot(seurat_object, features = extracted_marker_genes[7:12])
You can also look up your “favourite genes” by adding it. You can add multiple query genes by combing them in a vector c("gene1", "gene2", ...).
Using the code block below. Search for some genes of interest! If none come to mind, scroll-up to where you did you PCA analysis. And use some the top gene from here.
# VlnPlot(seurat_object, features = c("CD8A", "CD8B")
VlnPlot(seurat_object, features = "CD8B")
Finally you can use a DotPlot(). This plot is a good way to visualise how feature expression changes across different identity classes (clusters). The size of the dot encodes the percentage of cells within a class, while the color encodes the average expression level across all cells within a class (blue is high). In the example below we plot the expression of some well-known CD genes.
DotPlot(seurat_object, features = c("CD247", "CD3E", "CD9"))
From everything we have done above and markers found in the literature we can use the following markers to assign cell-types to the clusters. The table below outlines our decision:
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R, CCR7 | Naive CD4+ T |
| 1 | CD14, LYZ | CD14+ Mono |
| 2 | IL7R, S100A4 | Memory CD4+ |
| 3 | MS4A1 | B |
| 4 | CD8A | CD8+ T |
| 5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
| 6 | GNLY, NKG7 | NK |
| 7 | FCER1A, CST3 | DC |
| 8 | PPBP | Platelet |
To assign new cell names to our clusters we create a list of new the new cell names in order of the of the cluster you want to rename. I.e. if you wanted to rename Cluster 0 then the first element in the list would be assigned to cluster 0 and so on.
seurat_clusters assignment as it is saved to the seurat_object@meta.data$seurat_clusters. To revert back to the seurat_clusters numbers use the following command Idents(seurat_object) <- seurat_object@meta.data$seurat_clusters.new_cluster_ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new_cluster_ids) <- levels(seurat_object)
seurat_object <- RenameIdents(seurat_object, new_cluster_ids)
DimPlot(seurat_object, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
We can then look into how many cells we have in each cluster and you can also calculate this as a proportion.
# How many cells are in each cluster
table(Idents(seurat_object))
##
## Naive CD4 T CD14+ Mono Memory CD4 T B CD8 T FCGR3A+ Mono
## 686 479 455 344 325 160
## NK DC Platelet
## 140 35 14
# What proportion of cells are in each cluster?
prop.table(table(Idents(seurat_object)))
##
## Naive CD4 T CD14+ Mono Memory CD4 T B CD8 T FCGR3A+ Mono
## 0.260045489 0.181576952 0.172479151 0.130401820 0.123199393 0.060652009
## NK DC Platelet
## 0.053070508 0.013267627 0.005307051
Lets Inspect our Seurat Object for a final time. First we can look at the main object:
seurat_object
## An object of class Seurat
## 32738 features across 2638 samples within 1 assay
## Active assay: RNA (32738 features, 2000 variable features)
## 2 dimensional reductions calculated: pca, umap
Then we can see the meta data that has been added to the object. This includes to the QC results and the clustering.
head(seurat_object@meta.data)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 scRNAseq_2021 2421 781 3.0152829
## AAACATTGAGCTAC-1 scRNAseq_2021 4903 1352 3.7935958
## AAACATTGATCAGC-1 scRNAseq_2021 3149 1131 0.8891712
## AAACCGTGCTTCCG-1 scRNAseq_2021 2639 960 1.7430845
## AAACCGTGTATGCG-1 scRNAseq_2021 981 522 1.2232416
## AAACGCACTGGTAC-1 scRNAseq_2021 2164 782 1.6635860
## RNA_snn_res.0.5 seurat_clusters
## AAACATACAACCAC-1 0 0
## AAACATTGAGCTAC-1 3 3
## AAACATTGATCAGC-1 2 2
## AAACCGTGCTTCCG-1 5 5
## AAACCGTGTATGCG-1 6 6
## AAACGCACTGGTAC-1 2 2
Finally, we can save the output. This will save everything analysis (not the plots however) to an object that you can share with others or reload yourself the next time you would like inspect this data.
saveRDS(seurat_object, file = 'ei_10X_genomics_analysis.rds')